OK, first of all, Happy New (Civil) Year to everybody. Then, I'd like to note that I enjoyed the Israeli 2007 Perl Workshop that I attended yesterday a lot, and would like to thank all the organisers for making it happen. I posted some notes from topics we discussed in the conference to the mailing list, so you may find it interest to read them. I may post a more thorough report later on.
Now, to the main topic of this post. I've been on Freenode's #perl the other day, when we were discussing how to count the number of lines in a file. Someone suggested opening the files, and then using <$fh> and counting the number of lines. Someone else suggested trapping the output of wc -l. Then someone argued that trapping the output of wc -l is non-portable and will cost one in a costy fork. But is it slower?
To check, I created a very large text file using the following command:
locate .xml | grep '^/home/shlomi/Backup/Backup/2007/2007-12-07/disk-fs' | \
xargs cat > mega.xml
Here, I located all the files ending with .xml in my backup and concatenated them together into a file "mega.xml". The statistics for this file are:
$ LC_ALL=C wc mega.xml
195594 1704386 17790746 mega.xml
Then I ran the following benchmark using it:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark ':hireswallclock';
sub wc_count
{
my $s = `wc -l mega.xml`;
$s =~ /^(\d+)/;
return $1;
}
sub lo_count
{
open my $in, "<", "mega.xml";
local $.;
while(<$in>)
{
}
my $ret = $.;
close($in);
return $ret;
}
if (lo_count() != wc_count())
{
die "Error";
}
timethese(100,
{
'wc' => \&wc_count,
'lo' => \&lo_count,
}
);
The results?
shlomi:~/Download$ perl ../time-various-line-counts.pl
Benchmark: timing 100 iterations of lo, wc...
lo: 18.0495 wallclock secs (16.72 usr + 1.17 sys = 17.89 CPU) @ 5.59/s (n=100)
wc: 3.70755 wallclock secs ( 0.00 usr 0.03 sys + 1.77 cusr 1.91 csys = 3.71 CPU) @ 3333.33/s (n=100)
The wc method wins and is substantially faster. It's probably because wc is written in optimised C, and so counts the lines faster, despite the fact it had forked earlier.
For small files, the pure-Perl version wins. But for large files, wc is better. But naturally, it's not portable, which may be a deal-breaker in some cases.
The lesson of this is that forking processes or calling external is sometimes a reasonable thing to do. (as MJD noted earlier in the link).
But the results on my system (amd64-freebsd) look different: using a text file with nearly 200000 lines the wc version makes only about 22 iterations/second, much slower than on your system. The the perl version seems to be faster than on your system: 9/s.
sub tr_count {
local ( $/, $_ ) = \( 2**19 );
my $c = 0;
open my $in, "<", $file;
$c += y/\n// while <$in>;
return $c;
}
Only slightly slower than wc
on my machine.